Data Visualization (2017/18)

Solutions for Assignment 4 - Visualizing tabular data with many features

Presented by Group 17:

  • Srie Raam Mohan
  • Eshan Mushtaq

Date: 06.01.2018

Due date: Sunday, 07. Jan. 2018, 22:00

Please hand in a copy of the solved notebook and a html-export of it.

Remark: To keep the file size low, the notebook contains png images of solutions you should obtain. Double-click in the images to see the markdown code - do not execute these cells as you do not have the images.

In [1]:
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from bokeh.models import ColumnDataSource, ColorBar, LinearColorMapper, CategoricalColorMapper
from bokeh.models import Arrow, NormalHead, LabelSet, Label
from bokeh.plotting import figure, output_notebook, show
from bokeh.palettes import Category10, Category20, Viridis
from bokeh.transform import factor_cmap, linear_cmap
from bokeh.io import export_png
from bokeh.layouts import gridplot, row
from bokeh.core.properties import value

output_notebook()
Loading BokehJS ...

Exercise 1: Characterizing car types

In [2]:
# load the data
cars = pd.read_csv( '93cars.dat.csv', sep='\s+', na_values='*')

# substitute missing values
cars.LuggageCapacity = cars.LuggageCapacity.fillna(cars.LuggageCapacity.median())
cars.Cylinders = cars.Cylinders.fillna(cars.Cylinders.median())
cars.RearSeatRoom = cars.RearSeatRoom.fillna(cars.RearSeatRoom.median())
In [ ]:
cars.columns

**Task 1a)** Understand your data.

  • Which of the variables are quantitative and can be used in a PCA? Store their names in the following variable var.
    • MinPrice, MidPrice, MaxPrice, CityMpg, HighwayMpg, Cylinders, Engine, Horsepower, RPM, EngineRev, Tank, Passenger, Length, Wheelbase, Width, UTurn, RearSeatRoom, LuggageCapacity, Weight
In [3]:
var = ['MinPrice','MidPrice','MaxPrice', 'CityMpg', 'HighwayMpg', 'Cylinders', 'Engine', 'Horsepower', 'RPM', 'EngineRev','Tank', 'Passenger','Length', 'Wheelbase', 'Width','UTurn','RearSeatRoom','LuggageCapacity','Weight']
vars_group1 = ['MidPrice', 'Horsepower', 'CityMpg', 'Weight']
scale_factor, position_factor = 7.0, 8.0
  • Which variables can be used to divide the cars into groups? How many groups do you obtain and what size do they have? Do you expect differences between the groups and if yes which?

    • Variable: MidPrice, CityMpg, Horsepower, Weight
    • Number of groups + size of each group: 3 Groups
      • Group 1: Large share - MidPrice and CityMpg
      • Group 2: Small share - HorsePower and CityMpg
      • Group 3: Medium size - Average MidPrice and Horsepower
    • Hypothesis: Group 1 might have large share which contains cars that are affordable to the general mass. Group 2 considerably has small share which contains cars that are costly and owned by few people. Group 3 might be of Medium size which bridges the gap between both the costly and luxurious cars
In [4]:
# store standardized data in cars_std
cars_std = StandardScaler().fit_transform(cars[var].iloc[0:,:])
cars_group = StandardScaler().fit_transform(cars[vars_group1].iloc[0:,:])

# store PCA in variable pca
pca = PCA(n_components=cars_std.shape[1])
pca.fit(cars_std)

# store PCA in variable pca
pca_group = PCA(n_components=cars_group.shape[1])
pca_group.fit(cars_group)
Out[4]:
PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Important variables for further analysis (see docu above):

In [ ]:
pca.components_
pca.explained_variance_
pca.explained_variance_ratio_

**Task 1c)** How many principal components to use.

In [5]:
var_exp = pca.explained_variance_ratio_*100
cum_var_exp = np.cumsum(var_exp)
x = ['PC%s' %(i+1) for i in range(len(var))]

source = ColumnDataSource( dict(x=x, var_exp=var_exp, cum_var_exp=cum_var_exp) )

p = figure( plot_width=520, plot_height=400, toolbar_location=None, x_range=x, y_range=(-2,105), title="Cumulative Variance for PC Axis in Percentage" )

p.vbar( source=source, x='x', top='var_exp', width=0.9, bottom=0, legend='Explained variance' )

p.circle( x, cum_var_exp, color='orange', size=5, legend="Cumulative explained variance")
p.line( x, cum_var_exp, color='orange', line_width=2 )

p.legend.location = (235,155)
p.legend.border_line_color = None
p.xgrid.visible = False
p.xaxis.axis_label = "Principal Component"
p.yaxis.axis_label = "Explained variance in percent"

show(p)

Here is a sample image to check your routine:

  • How much variance explains the first and the first two principal component(s) roughly?
    • Answer: PC1 explains roughly 60% of the variance whereas PC1 & PC2 combined explains 76% of the variance
  • How many components do you need to explain 90% of variance in the data (roughly, use figure estimate)?
    • Answer: We need 5 Principal Components to explain 90% of variance with the given data

**Task 1d)** Interpret the axis.

The following chart presents a projection of the cars data onto the first two principal components. Use techniques you learned in class to derive the meaning of the axes.

In [6]:
pca_auto = pd.DataFrame( pca.transform(cars_std), columns=['PC%i' % (i+1) for i in range(pca.n_components_)])

pca_auto['label'] = cars.Type
pca_auto['label'] = pca_auto['label'].astype('str')

factors = sorted(pca_auto.label.unique())

source = ColumnDataSource(pca_auto)

p = figure( plot_width=600, plot_height=600, y_range=(-4.5,4.8), title="PCA Biplot with all Quantitavie Variables")

p.circle( source=source, x='PC1', y='PC2', size=9, legend='label',
          color=factor_cmap('label', palette=Category10[10], factors=factors))

feature_vectors = pca.components_.T
for i, v in enumerate(feature_vectors):
        p.add_layout(Arrow(end=NormalHead(fill_color="orange",size=10,line_width=2), x_start=0, y_start=0, x_end=scale_factor*v[0], y_end=scale_factor*v[1]))
        p.add_layout(Label(x=position_factor*v[0], y=position_factor*v[1], text=var[i], render_mode='canvas', background_fill_color='white', text_font_size='10pt'))
p.xaxis.axis_label = 'PC1' 
p.yaxis.axis_label = 'PC2'
p.legend.location = "bottom_right"

show(p)
  • Explain the PC1 and PC2 axis. What is typical for cars on the left vs. right (x-axis)? What is typical for cars located towards the top vs. bottom of the chart (y-axis)?

    • PC1 axis explains the most variance in the data whereas PC2 explains the next most variance. As we can see from the plot that RPM, EngineRev and Mpg has positive correleation whereas rest of the other variables are negatively correalted with respect to the PC1 axis.
    • When considering the left and right of x-axis, we can see that small and compact cars have goog Mpg as the cars lie near the direction of the axis, where as the cars in the left of x-axis like Midsize and Large cars does not have that good Mpg. We can also see that EngineRev is good for Sporty cars whereas not good for Van and Large vehicles.
    • Regarding the y-axis, we can interpret that price has large variance with the LuggageCapacity and other dimensions of the vehicle. We can also interpret that Tank, cylinders, Engine and Weight has very less variance. For ex., more the number of cylinders and the weight of the vehicle increases
In [7]:
pca_auto = pd.DataFrame( pca_group.transform(cars_group), columns=['PC%i' % (i+1) for i in range(pca_group.n_components_)])

pca_auto['label'] = cars.Type
pca_auto['label'] = pca_auto['label'].astype('str')

factors = sorted(pca_auto.label.unique())

source = ColumnDataSource(pca_auto)

p = figure( plot_width=600, plot_height=600, y_range=(-8,8), title="PCA Biplot with Grouped Variables")

p.circle( source=source, x='PC1', y='PC2', size=9, legend='label',
          color=factor_cmap('label', palette=Category10[10], factors=factors))

feature_vectors = pca_group.components_.T
for i, v in enumerate(feature_vectors):
        p.add_layout(Arrow(end=NormalHead(fill_color="orange",size=10,line_width=2), x_start=0, y_start=0, x_end=scale_factor*v[0], y_end=scale_factor*v[1]))
        p.add_layout(Label(x=position_factor*v[0], y=position_factor*v[1], text=vars_group1[i], render_mode='canvas', background_fill_color='white', text_font_size='10pt'))
p.xaxis.axis_label = 'PC1' 
p.yaxis.axis_label = 'PC2' 
p.legend.location = "bottom_right"

show(p)

You should get the following chart:

  • Use one of the groupings you discussed in Task 1a) and color the plot accordingly. What can you tell about the groups? Feal free to comment on additional findings here.
    • We can see from the biplot that CityMPG is good for small cars where as it is completely opposite for sporty cars. Moreover, the negeative correlation of Mpg grows in the order of Small, Compact/Midsize, Large/Van, Sporty vehicles.
    • We can also see that CityMpg and Horsepower are negative correlated in the sense that if horsepower is more the cars will get less Miles per Gallon. Same goes with the weight too as heavy vehicles consumes too much fuel.
    • MidPrice and Horsepower have close variance interpreting the fact that more the horsepower, the price of the vehicle will increase. We can also see that Sporty and Mid-size are the cars lie in the range of high price and horsepower variance.

Exercise 2: Analyzing tumor data

The following code reads the tumor data and does some preprocessing. You have access to the following variables:

  • df contains the tabular data with genes and labels. Labels are stored in the last column
  • labels the last column as a seperate array
  • df_std a standardized version of the data
  • ngenes the number of genes
  • cancer_types list of cancers contained in the data
In [8]:
# read the gene expression data, transpose it, and name the columns
df = pd.read_csv('nci.csv', header=None, sep='\s+')
df = df.T
df.columns = ['G%i' % i for i in range(len(df.columns))]

ngenes = len(df.columns)

# read the labels and add them to the dataframe
labels = pd.read_csv( 'nci-labels.csv' )
df['cancer'] = labels

# remove synthetic tumors
df = df[df['cancer'].isin(['CNS', 'RENAL', 'BREAST', 'NSCLC', 'UNKNOWN', 'OVARIAN', 'MELANOMA',
       'PROSTATE', 'LEUKEMIA', 'COLON' ])]
df = df.reset_index()
labels = df.cancer

# get a standardized version of the data
df_std = StandardScaler().fit_transform(df.iloc[:,1:-1])

# get the occuring cancer types
cancer_types = sorted(labels.unique())
print cancer_types
['BREAST', 'CNS', 'COLON', 'LEUKEMIA', 'MELANOMA', 'NSCLC', 'OVARIAN', 'PROSTATE', 'RENAL', 'UNKNOWN']

Helper routines

In [9]:
def getLegend( df ):
    df = df.sort_values(by=['cancer'])
    source = ColumnDataSource(df)

    l = figure( plot_width=200, plot_height=300, y_range=(100,101) )
    l.circle( source=source, x='G0', y='G1', size=7, legend='cancer', color=factor_cmap('cancer', palette=Category10[10], factors=cancer_types))
    l.xaxis.visible = False
    l.yaxis.visible = False
    l.xgrid.visible = False
    l.ygrid.visible = False
    l.legend.location = "top_left"
    l.legend.border_line_color = None
    
    return l
In [10]:
df.cancer.value_counts()
Out[10]:
RENAL       9
NSCLC       9
MELANOMA    8
COLON       7
BREAST      7
LEUKEMIA    6
OVARIAN     6
CNS         5
PROSTATE    2
UNKNOWN     1
Name: cancer, dtype: int64

**Task 2a)** Make a feasibility check.

  • Show 6 scatterplots of your data using random genes for the axes.
In [11]:
row = [] 

def getRandomGenePlot( df ):

    p = figure( plot_width=300, plot_height=300 )
    ids = np.random.choice( len(df.columns)-1, 2 )
    var1 = 'G'+str(ids[0])
    var2 = 'G'+str(ids[1])
    p.circle( source=df, x=var1, y=var2, size=7,
          color=factor_cmap('cancer', palette=Category10[10], factors=cancer_types))
    p.xaxis.axis_label = var1
    p.yaxis.axis_label = var2
    return p

for i in range(6):
    row.append(getRandomGenePlot(df))

row.append(getLegend(df))
    
show( gridplot([[row[0], row[1], row[2]], [row[3], row[4], row[5]], [None, None, row[6]]]))

Should give:

  • Looking at the charts you created above, can you find clusters for the different cancer types?
    • Finding Clusters with Random Gene Data is not feasible. The fact that Single or couple of gene data wont be enough to express the cancer type supports that clusters can't be found.
  • Explain the following chart (also understand the code to compute it).

    Hint: The method np.linalg.norm( x-y) computes the Euklidean distance between two vectors x and y.

In [13]:
cnt = pd.DataFrame( index=cancer_types, columns=['same', 'other']).fillna(0)


for i in df_std:
    dfc = pd.DataFrame( {'d': [np.linalg.norm(i-j) for j in df_std], 'l': labels} ).sort_values(by='d')
    if dfc.iloc[0,1] is dfc.iloc[1,1]: 
        cnt.loc[dfc.iloc[0,1],'same'] += 1
    else: 
        cnt.loc[dfc.iloc[0,1],'other'] += 1
        
p = figure( y_range=cancer_types, plot_height=350 , title="Gene Similarity data for Cancer Types")
p.hbar_stack( ['same','other'], y='index', color=['steelblue','orange'], height=0.9, source=cnt,
              legend=[value(x) for x in ['same','other']])
p.legend.location = 'bottom_right'
show(p)
  • The above code computes the euclidean distance between each gene with all the other genes and checks whether the closest gene data shares the same property with the gene under review. It sheds some light over with the idea that which gene data shares similarity between different types of cancer.
  • The above chart tells us that which cancers shares the similar gene with other cancer and how much it differs from other cancer types.
  • In summary, what do you think? Is it possible to classify different types of cancer using gene expression data? Which cancer types are probably very hard to describe which are easy? Why?
    • It is possible to classify cancer types to some extent because some cancer types shares similar gene structure. Therfore, grouping and classification of cancer types is also possible.
    • Unknown and Renal Prostate cancer are difficult to describe because it does not share the gene data with any other identified cancer types whereas colon cancer type share the similarity with other cancer types

**Task 2b)** Compute the PCA projection and visualize the first two components in a scatterplot. Add outlines to each cancer type to make analysis easier.

Hint: scipy provides a convex hull implementation https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.ConvexHull.html. hull.vertices returns a list of sorted boundary vertices.

In [14]:
# compute pca of data df_std
pca = PCA(n_components=60)
pca.fit(df_std)

# create a dataframe of the projected data
pca_data = pd.DataFrame( pca.transform(df_std), columns=['PC%i' % (i+1) for i in range(pca.n_components_)])
pca_data['label'] = labels
In [15]:
from scipy.spatial import ConvexHull

source = ColumnDataSource(pca_data)

# color mapper. Usage: mapper['LEUKEMIA'] return the colorname of the given cancer type.
mapper = dict(zip(cancer_types, Category10[10]))

p = figure( plot_width=600, plot_height=600, toolbar_location=None, title= "Convex Hull graph for Cancer Types" )

# render the convex hulls first
for c in cancer_types:
    # get the data for the current cancer type
    data = pca_data[pca_data.label == c].reset_index()
    pc_data = data.as_matrix(['PC1','PC2'])
    if pc_data.shape[0] < 3:
        print c + " type has less than 3 points to determine convex hull"
        continue
    convex_hull = ConvexHull(pc_data).simplices
    for simplex in convex_hull:
        p.line(pc_data[simplex, 0], pc_data[simplex, 1], color=mapper[c],line_width=2)


p.circle( source=source, x='PC1', y='PC2', size=9, 
          color=factor_cmap('label', palette=Category10[10], factors=factors))
p.xaxis.axis_label = 'PC1' 
p.yaxis.axis_label = 'PC2' 

        
show( gridplot( [[p, getLegend(df)]] ) )
PROSTATE type has less than 3 points to determine convex hull
UNKNOWN type has less than 3 points to determine convex hull

Sample solution:

  • Identify tumor classes (+ explain briefly):
    • Cluster -
      • Answer: Colon. We can able to see that Colon cancer type has visible cluster and is unique. This is because they share the similar gene data with other cancer types and there are some data that can be uniquely identified for Colon caner type
    • No compact cluster -
      • Answer: NSCLC. From the plot, we can interpret that NSCLC cancer type comprises Ovarian, Renal, Breat and CNS cancer types too. This means that gene data shows similarity to all the above mentioned cancer types and there is no unique or distiguishable data to identify it seperately.
    • Hard to distinguish -
      • Answer : Prostate and Unknown. Prostate and Unknown cancer types shows very less variation and therfore not enough points to form a cluster or representation in the first two PC axis.

**Task 2c)** How confident should you be in the results?

In [16]:
n = 60

var_exp = pca.explained_variance_ratio_[:n]*100
cum_var_exp = np.cumsum(var_exp)
x = ['PC%s' %(i+1) for i in range(n)]

source = ColumnDataSource( dict(x=x, var_exp=var_exp, cum_var_exp=cum_var_exp) )

p = figure( plot_width=520, plot_height=400, toolbar_location=None, x_range=x, y_range=(-2,105), title = "Cumulative Variance for PC Axis in Percentage" )

p.vbar( source=source, x='x', top='var_exp', width=0.9, bottom=0, legend='Explained variance' )

p.circle( x, cum_var_exp, color='orange', size=5, legend="Cumulative explained variance")
p.line( x, cum_var_exp, color='orange', line_width=2 )

p.legend.location = (235,155)
p.legend.border_line_color = None
p.xgrid.visible = False
p.yaxis.axis_label = "Explained variance in percent"

show(p)
  • Take a look at the provided explained variance plot. How reliable is you analysis based on the first two components?
    • The first Two PC axis covers only the 18-19% of the variance of the whole genome data. So the answers are not reliable regarding the clusters and classification of the cancer types. But still since it captures the two most variations, we can able to identify a bird view analysis of the gene data.